{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Structural Diversity Assessment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DeepTCR can be used to assess the structural diversity within a repertoire. One can think of this structural diversity assessment as being a measure of the number of antigens or concepts within a repertoire. A repertoire that has low diversity is recognizing few antigens while one with a high diversity can recognize more. We will assess two measures of diversity. The first one is the number of clusters/concepts of TCR sequences are present in each sample/repertoire. And the second is how entropic is the distribution of the sequence reads across these clusters. For example, a sample can have 12 clusters but if 90% of the repertoire is is one cluster, this would be a low entropic repertoire."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we will load data and train the VAE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "import sys\n",
    "sys.path.append('../../')\n",
    "from DeepTCR.DeepTCR import DeepTCR_U\n",
    "\n",
    "# Instantiate training object\n",
    "DTCRU = DeepTCR_U('Tutorial')\n",
    "\n",
    "#Load Data from directories\n",
    "DTCRU.Get_Data(directory='../../Data/Murine_Antigens',Load_Prev_Data=False,aggregate_by_aa=True,\n",
    "               aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)\n",
    "\n",
    "#Train VAE\n",
    "DTCRU.Train_VAE(Load_Prev_Data=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will then execute the following command to generate diversity measurements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm\n",
      "Neighbors computed in 2.4685208797454834 seconds\n",
      "Jaccard graph constructed in 1.3048810958862305 seconds\n",
      "Wrote graph to binary file in 0.25426173210144043 seconds\n",
      "Running Louvain modularity optimization\n",
      "After 1 runs, maximum modularity is Q = 0.892024\n",
      "Louvain completed 21 runs in 3.1881520748138428 seconds\n",
      "PhenoGraph complete in 7.2290565967559814 seconds\n"
     ]
    }
   ],
   "source": [
    "DTCRU.Structural_Diversity()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because the first step of this algorithm is to cluster the data via the phenograph algorithm, we can also change the sample parameter to sub-sample for our initial clustering before applying a K-nearest neighbors algorithm to assign the rest of the sequences. This is helpful in the case of very large TCRSeq file (like those collected in from Tumor-Infiltrating Lymphocytes (TIL))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm\n",
      "Neighbors computed in 0.12893915176391602 seconds\n",
      "Jaccard graph constructed in 0.27338743209838867 seconds\n",
      "Wrote graph to binary file in 0.029012203216552734 seconds\n",
      "Running Louvain modularity optimization\n",
      "After 1 runs, maximum modularity is Q = 0.678959\n",
      "Louvain completed 21 runs in 2.3945066928863525 seconds\n",
      "PhenoGraph complete in 2.834502696990967 seconds\n"
     ]
    }
   ],
   "source": [
    "DTCRU.Structural_Diversity(sample=500)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then view the structural diversity metrics in the respective object variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                  Sample    Class   Entropy  Num of Clusters\n",
      "0              Db-F2.tsv    Db-F2  1.743952                9\n",
      "1             Db-M45.tsv   Db-M45  1.927988               10\n",
      "2              Db-NP.tsv    Db-NP  1.664511               10\n",
      "3              Db-PA.tsv    Db-PA  1.685198               10\n",
      "4             Db-PB1.tsv   Db-PB1  1.465170               10\n",
      "5             Kb-M38.tsv   Kb-M38  1.190016               10\n",
      "6      Kb-5_TILS_SIY.tsv   Kb-SIY  1.774177               10\n",
      "7    Kb-1_Sp_Con_SIY.tsv   Kb-SIY  2.091482               10\n",
      "8     Kb-4_Sp_T_SIYI.tsv   Kb-SIY  2.043941               10\n",
      "9       Kb-6_dLN_SIY.tsv   Kb-SIY  2.051375               10\n",
      "10    Kb-3_Sp_T_TRP2.tsv  Kb-TRP2  1.706198               10\n",
      "11  Kb-2_Sp_Con_TRP2.tsv  Kb-TRP2  1.955121               10\n",
      "12     Kb-7_dLN_TRP2.tsv  Kb-TRP2  2.138611               10\n",
      "13           Kb-m139.tsv  Kb-m139  2.136927               10\n"
     ]
    }
   ],
   "source": [
    "print(DTCRU.Structural_Diversity_DF)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the entropy and number of clusters for each sample."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
